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Abstract 



The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential 
decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, 
given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural 
Bayesian algorithm. The basic idea is to choose an arm to play according to its probability of being the best arm. 
Thompson Sampling algorithm has experimentally been shown to be close to optimal. In addition, it is efficient to 
implement and exhibits several desirable properties such as small regret for delayed feedback. However, theoretical 
understanding of this algorithm was quite limited. In this paper, for the first time, we show that Thompson Sampling 
algorithm achieves logarithmic expected regret for the stochastic multi-armed bandit problem. More precisely, for 
the stochastic two-armed bandit problem, the expected regret in time T is 0(i^ + And, for the stochastic 

A'^-armed bandit problem, the expected regret in time T is 0( |^(X]il2 S^)^] In 2^)- Our bounds are optimal but for 
the dependence on Ai and the constant factors in big-Oh. 

1 Introduction 

Multi-armed bandit (MAB) problem models the exploration/exploitation trade-off inherent in sequential decision prob- 
lems. Many versions and generalizations of the multi-armed bandit problem have been studied in the literature; in this 
paper we will consider a basic and well-studied version of this problem: the stochastic multi-armed bandit problem. 
Among many algorithms available for the stochastic bandit problem, some popular ones include Upper Confidence 
Bound (UCB) family of algorithms, (e.g., f9^,T|, and more recently |3 1, 1 10|, |8 |), which have good theoretical guaran- 
tees, and the algorithm by |,4J, which gives optimal strategy under Bayesian setting with known priors and geometric 
time-discounted rewards. In one of the earliest works on stochastic bandit problems, lfT4l proposed a natural random- 
ized Bayesian algorithm to minimize regret. The basic idea is to assume a simple prior distribution on the parameters 
of the reward distribution of every arm, and at any time step, play an arm according to its posterior probability of being 
the best arm. This algorithm is known as Thompson Sampling (TS), and it is a member of the family of randomized 
probability matching algorithms. We emphasize that although TS algorithm is a Bayesian approach, the description 
of the algorithm and our analysis apply to the prior-free stochastic multi-armed bandit model where parameters of the 
reward distribution of every arm are fixed, though unknown (refer to Section[TT|i. One could think of the "assumed" 
Bayesian priors as a tool employed by the TS algorithm to encode the current knowledge about the arms. Thus, our 
regret bounds for Thompson Sampling are directly comparable to the regret bounds for UCB family of algorithms 
which are a frequentist approach to the same problem. 

Recently, TS has attracted considerable attention. Several studies (e.g., ||6l [T3] |2] \12]) have empirically demon- 
strated the efficacy of Thompson Sampling: 1,13.1 provides a detailed discussion of probability matching techniques in 
many general settings along with favorable empirical comparisons with other techniques. Q demonstrate that empir- 
ically TS achieves regret comparable to the lower bound of (9'\; and in applications like display advertising and news 
article recommendation, it is competitive to or better than popular methods such as UCB. In their experiments, TS 
is also more robust to delayed or batched feedback (delayed feedback means that the result of a play of an arm may 
become available only after some time delay, but we are required to make immediate decisions for which arm to play 
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next) than the other methods. A possible explanation may be that TS is a randomized algorithm and so it is unlikely to 
get trapped in an early bad decision during the delay. Microsoft's adPredictor (|5 1) for CTR prediction of search ads 
on Bing uses the idea of Thompson Sampling. 

It has been suggested (|2|) that despite being easy to implement and being competitive to the state of the art 
methods, the reason TS is not very popular in literature could be its lack of strong theoretical analysis. Existing 
theoretical analyses in ll6l [m provide weak guarantees, namely, a bound of o{T) on expected regret in time T. In this 
paper, for the first time, we provide a logarithmic bound on expected regret of TS algorithm in time T that is close to 
the lower bound of |9 1. Before stating our results, we describe the MAB problem and the TS algorithm formally. 



1.1 The multi-armed bandit problem 

We consider the stochastic multi-armed bandit (MAB) problem: We are given a slot machine with N arms; at each 
time step t = 1,2,3,..., one of the N arms must be chosen to be played. Each arm i, when played, yields a random 
real-valued reward according to some fixed (unknown) distribution with support in [0, 1]. The random reward obtained 
from playing an arm repeatedly are i.i.d. and independent of the plays of the other arms. The reward is observed 
immediately after playing the arm. 

An algorithm for the MAB problem must decide which arm to play at each time step t, based on the outcomes 
of the previous < — 1 plays. Let denote the (unknown) expected reward for arm i. A popular goal is to maximize 
the expected total reward in time T, i.e., E[X]^Li Mi(t)]' where i{t) is the arm played in step t, and the expectation is 
over the random choices of i{t) made by the algorithm. It is more convenient to work with the equivalent measure of 
expected total regret: the amount we lose because of not playing optimal arm in each step. To formally define regret, 
let us introduce some notation. Let fi* := max^ ^j, and A, := /i* — /ij. Also, let fcj(i) denote the number of times 
arm i has been played up to step t — 1. Then the expected total regret in time T is given by 

E [7^(^)] = E [eJ=M - ^^^(t))] = a. • e mt)] . 

Other performance measures include PAC-style guarantees; we do not consider those measures here. 



1.2 Thompson Sampling 

For simplicity of discussion, we first provide the details of Thompson Sampling algorithm for the Bernoulli bandit 
problem, i.e. when the rewards are either or 1, and for arm i the probability of success (reward =1) is /i^. This 
description of Thompson Sampling follows closely that of IJl. Next, we propose a simple new extension of this 
algorithm to general reward distributions with support [0,1], which will allow us to seamlessly extend our analysis for 
Bernoulli bandits to general stochastic bandit problem. 

The algorithm for Bernoulli bandits maintains Bayesian priors on the Bernoulli means /i/s. Beta distribution turns 
out to be a very convenient choice of priors for Bernoulli rewards. Let us briefly recall that beta distributions form a 
family of continuous probability distributions on the interval (0, 1). The pdf of Beta(a, /3), the beta distribution with 
parameters a > 0, /3 > 0, is given by f{x; a, (3) = r{a)r{i3) ^°'~^^^^ ^ x)'^~^. The mean of Beta(a, /3) is a/ (a + (3); 
and as is apparent from the pdf, higher the a, /3, tighter is the concentration of Beta(Q!, /?) around the mean. Beta 
distribution is useful for Bernoulli rewards because if the prior is a Beta(Q!, /?) distribution, then after observing a 
Bernoulli trial, the posterior distribution is simply Beta(a + 1, /3) or Beta(a, /3 + 1), depending on whether the trial 
resulted in a success or failure, respectively. 

The Thompson Sampling algorithm initially assumes arm i to have prior Beta(l, 1) on fii, which is natural because 
Beta(l, 1) is the uniform distribution on (0, 1). At time t, having observed Si{t) successes (reward = 1) and Fi{t) 
failures (reward = 0) in ki (t) = Si (t) +Fi (t) plays of arm i, the algorithm updates the distribution on /i^ as Beta(5i {t) + 
1, Fi{t) + 1). The algorithm then samples from these posterior distributions of the /i^'s, and plays an arm according 
to the probability of its mean being the largest. We summarize the Thompson Sampling algorithm below. 
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Algorithm 1: Thompson Sampling for Bernoulli bandits 



= 0, = 0. 
foreach i = 1, 2, . . . , do 

For each arm i — 1, . . . , N, sample 9^ {t) from the Beta(S'i + 1, + 1) distribution. 
Play arm i{t) argmax^ 6i{t) and observe reward r-j. 
If r = 1, then S^ = S, + 1, else Fi ^ F, + 1. 

end 

We adapt the Bernoulli Thompson sampling algorithm to the general stochastic bandits case, i.e. when the rewards 
for arm i are generated from an arbitrary unknown distribution with support [0, 1] and mean jii, in a way that allows 
us to reuse our analysis of the Bernoulli case. To our knowledge, this adaptation is new. We modify TS so that after 
observing the reward ft G [0, 1] at time t, it performs a Bernoulli trial with success probability ft- Let random variable 
rt denote the outcome of this Bernoulli trial, and let {Si{t), Fi{ty} denote the number of successes and failures in 
the Bernoulli trials until time t. The remaining algorithm is the same as for Bernoulli bandits. Algorithm 2 gives the 
precise description of this algorithm. 

We observe that the probability of observing a success (i.e., = 1) in the Bernoulli trial after playing an arm i in 
the new generalized algorithm is equal to the mean reward /i^. Let fi denote the (unknown) pdf of reward distribution 
for arm i. Then, on playing arm i, 

Pr(r't = 1) = /g ffi{f)df = ^ii. 

Thus, the probability of observing = 1 is same and Si{t),Fi{t) evolve exactly in the same way as in the case of 
Bernoulli bandits with mean fii. Therefore, the analysis of TS for Bernoulli setting is applicable to this modified TS 
for the general setting. This allows us to replace, for the purpose of analysis, the problem with general stochastic 
bandits with Bernoulli bandits with the same means. We use this observation to confine the proofs in this paper to the 
case of Bernoulli bandits only. 



Algorithm 2: Thompson Sampling for general stochastic bandits 
~S~O^F~0. 
foreach t = 1, 2, . . . , do 

For each arm i — 1, . . . , N, sample 0i{t) from the Beta(S'i + l,Fi + 1) distribution. 
Play arm i{t) :— argmax.; 9i{t) and observe reward ft- 

Perform a BernouUi trial with success probability ft and observe output Vf 

If rt = 1, then Si = 5, + 1, else F, = F, + 1. 

end 



1.3 Our results 

In this article, we bound the finite time expected regret of Thompson Sampling. From now on we will assume that the 
first arm is the unique optimal arm, i.e., /i* = /ii > argmax^^i jii. Assuming that the first arm is an optimal arm 
is a matter of convenience for stating the results and for the analysis. The assumption of unique optimal arm is also 
without loss of generality, since adding more arms with /i^ = jj* can only decrease the expected regret; details of this 
argument are provided in Appendix [A] 



Theorem 1. For the two-armed stochastic bandit problem (N = 2), Thompson Sampling algorithm has expected 
regret /j 

mm] -o[-^ + -^ 

in time T, where A = /ii — p2- 

Theorem 2. For the N -armed stochastic bandit problem, Thompson Sampling algorithm has expected regret 

E[7^(T)]<0 iff]-^) InT 



\a=2 , 
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in time T, where = — ^i. 



Remark 1. For the N -armed bandit problem, we can obtain an alternate bound of 



E[7^(T)] < O 



A 



max I V ^ 



1 



A3 . 

rmn 



^ A2 

\a=1 °- , 



InT 



by slight modification to the proof. The above bound has a better dependence on N than in Theorem [2] but worse 
dependence on A^s. Here A,„i„ ~ minimi /S.i,lS.„iax — maxi^i A^. 

In interest of readability, we used big-Oh notation to state our results. The exact constants are provided in the 
proofs of the above theorems. Let us contrast our bounds with the previous work. |9| proved the following lower 
bound on regret of any bandit algorithm; 



E[7e(T)] > 



■ N 

E 

.1=2 



oil] 



InT, 



where D denotes the KL divergence. They also gave algorithms asymptotically achieving this guarantee, though 
unfortunately thek algorithms are not efficient. [IJ gave the UCBl algorithm, which is efficient and achieves the 
following bound: 

r M 

^ ' ~ (l + 7rV3) 



E[7^(^)] < 



N 



^ A, 



InT- 




For many settings of the parameters, the bound of Auer et al. is not far from the lower bound of Lai and Robbins. 
Our bounds are optimal in terms of dependence on T, but inferior in terms of the constant factors and dependence on 
A. We note that for the two-armed case our bound closely matches the bound of [l]. For the iV-armed setting, the 
exponent of A's in our bound is basically 4 compared to the exponent 1 for UCBL 

More recently, |8| gave Bayes-UCB algorithm which achieves regret bounds close to the lower bound of ||9l for 
Bernoulli rewards. Bayes-UCB is a UCB like algorithm, where the upper confidence bounds are based on the quantiles 
of Beta posterior distributions. Interestingly, these upper confidence bounds turn out to be similar to those used by 
algorithms in |3| and [lOJ. Bayes-UCB can be seen as an hybrid of TS and UCB. However, the general structure of 
the arguments used in is similar to 1 1 1; for the analysis of Thompson Sampling we need to deal with additional 
difficulties, as discussed in the next section. 



2 Proof Techniques 

In this section, we give an informal description of the techniques involved in our analysis. We hope that this will aid 
in reading the proofs, though this section is not essential for the sequel. We assume that all arms are Bernoulli arms, 
and that the first arm is the unique optimal arm. As explained in the previous sections, these assumptions are without 
loss of generality. 

Main technical difficulties. Thompson Sampling is a randomized algorithm which achieves exploration by choosing 
to play the arm with best sampled mean, among those generated from beta distributions around the respective empirical 
means. The beta distribution becomes more and more concentrated around the empirical mean as the number of plays 
of an arm increases. This randomized setting is unlike the algorithms in UCB family, which achieve exploration by 
adding a deterministic, non-negative bias inversely proportional to the number of plays, to the observed empirical 
means. Analysis of TS poses difficulties that seem to require new ideas. 

For example, following general line of reasoning is used to analyze regret of UCB like algorithms in two-arms 
setting (for example, in 1 1 1): once the second arm has been played sufficient number of times, its empirical mean is 
tightly concentrated around its actual mean. If the first arm has been played sufficiently large number of times by then, 

'For any two functions f(n),g{n), f{n) = 0{g{n)) if there exist two constants no and c such that for all n > no, /(n) < cg{n). 
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it will have an empirical mean close to its actual mean and larger than that of the second arm. Otherwise, if it has been 
played small number of times, its non-negative bias term will be large. Consequently, once the second arm has been 
played sufficient number of times, it will be played with very small probability (inverse polynomial of time) regardless 
of the number of times the first arm has been played so far. 

However, for Thompson Sampling, if the number of previous plays of the first arm is small, then the probability of 
playing the second arm could be as large as a constant even if it has already been played large number of times. For 
instance, if the first arm has not been played at all, then 9i{t) is a uniform random variable, and thus 9i{t) < 6*2 (t) 
with probability 92{t) ~ ^2- As a result, in our analysis we need to carefully consider the distribution of the number 
of previous plays of the first arm, in order to bound the probability of playing the second arm. 

The observation just mentioned also points to a challenge in extending the analysis of TS for two-armed bandit to 
the general iV-armed bandit setting. One might consider analyzing the regret in the iV-armed case by considering only 
two arms at a time — the first arm and one of the suboptimal arms. We could use the observation that the probability 
of playing a suboptimal arm is bounded by the probability of it exceeding the first arm. However, this probability also 
depends on the number of previous plays of the two arms, which in turn depend on the plays of the other arms. Again, 
fl I, in their analysis of UCB algorithm, overcome this difficulty by bounding this probability for all possible numbers 
of previous plays of the first arm, and large enough plays of the suboptimal arm. For Thompson Sampling, due to 
the observation made earlier, the (distribution of the) number of previous plays of the first arm needs to be carefully 
accounted for, which in turn requires considering all the arms at the same time, thereby leading to a more involved 
analysis. 

Proof outline for two arms setting. Let us first consider the special case of two arms which is simpler than the 
general N arms case. Firstly, we note that it is sufficient to bound the regret incurred during the time steps after the 
second arm has been played L = 24(ln T)/ times. The expected regret before this event is bounded by 24(ln T) /A 
because only the plays of the second arm produce an expected regret of A; regret is when the first arm is played. 
Next, we observe that after the second arm has been played L times, the following happens with high probability: 
the empirical average reward of the second arm from each play is very close to its actual expected reward /i2, and its 
beta distribution is tightly concentrated around /i2- This means that, thereafter, the first arm would be played at time 
t if 6i{t) turns out to be greater than (roughly) /i2- This observation allows us to model the number of steps between 
two consecutive plays of the first arm as a geometric random variable with parameter close to Pr[0i(t) > ^2]- To be 
more precise, given that there have been j plays of the first arm with s{j) successes and /(j) — j — s{j) failures, we 
want to estimate the expected number of steps before the first arm is played again (not including the steps in which 
the first arm is played). This is modeled by a geometric random variable X{j, s{j), 1^2) with parameter Pi[9i > ^2], 
where 9i has distribution Beta(s(j) + l,j - s{j) + 1), and thus E [X {j , s{j) , fj.2) \ s{j)] = 1/Pr[6'i > 112] - 1. 
To bound the overall expected number of steps between the j*'* and (j + 1)*'' play of the first arm, we need to take 
into account the distribution of the number of successes s{j). For large j, we use Chernoff-Hoeffding bounds to say 
that s{j)/j « /ii with high probability, and moreover 9i is concentrated around its mean, and thus we get a good 
estimate of E [E [X{j, s{j), ^2) \ s{i)]]- However, for small j we do not have such concentration, and it requires a 
delicate computation to get a bound on E [E s(j), 1x2) \ The resulting bound on the expected number of 

steps between consecutive plays of the first arm bounds the expected number of plays of the second arm, to yield a 
good bound on the regret for the two-arms setting. 

Proof outline for N arms setting. At any step t, we divide the set of suboptimal arms into two subsets: saturated 
and unsaturated. The set C{t) of saturated arms at time t consists of arms a that have already been played a sufficient 
number (La ~ 24(lnr)/A^) of times, so that with high probability, 9a{t) is tightly concentrated around jia- As 
earlier, we try to estimate the number of steps between two consecutive plays of the first arm. After j*'' play, the 
(j + 1)*'' play of first arm will occur at the earliest time t such that 9i{t) > 9i{t),\/i 7^ 1. The number of steps 
before 9i{t) is greater than 9a{t) of all saturated arms a £ C{t) can be closely approximated using a geometric 
random variable with parameter close to Pr(6'i > maxa(=c{t) Ma), as before. However, even if 9i (t) is greater than the 
9a{t) of all saturated arms a £ C{t), it may not get played due to play of an unsaturated arm u with a greater 9u{t). 
Call this event an "interruption" by unsaturated arms. We show that if there have been j plays of first arm with s{j) 
successes, the expected number of steps until the (j + 1)*^ play can be upper bounded by the product of the expected 
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value of a geometric random variable similar to X{j, s{j), maxa /ia) defined earlier, and the number of interruptions 
by the unsaturated arms. Now, the total number of interruptions by unsaturated arms is bounded by J2u=2 ^« (since 
an arm u becomes saturated after _L„ plays). The actual number of interruptions is hard to analyze due to the high 
variabiUty in the parameters of the unsaturated arms. We derive our bound assuming the worst case allocation of these 
J2u interruptions. This step in the analysis is the main source of the high exponent of A in our regret bound for 
the A^-armed case compared to the two-armed case. 



3 Regret bound for the two-armed bandit problem 

In this section, we present a proof of Theorem [T] our result for the two-armed bandit problem. Recall our assumption 
that all arms have Bernoulli distribution on rewards, and that the first arm is the unique optimal arm. 

Let random variable Jq denote the number of plays of the first arm until L = 24(lnr)/A^ plays of the second 
arm. Let random variable tj denote the time step at which the j*'* play of the first arm happens (we define to = 0). 
Also, let random variable Yj = tj+i — tj — 1 measure the number of time steps between the j*^ and (j + 1)*'' plays 
of the first arm (not counting the steps in which the j*^ and {j + 1)*'' plays happened), and let s{j) denote the number 
of successes in the first j plays of the first arm. Then the expected number of plays of the second arm in time T is 
bounded by 

E[fc2(T)]<L + E[Ej=7„>S- 

To understand the expectation of Yj, it will be useful to define another random variable X{j, s, y) as follows. We 
perform the following experiment until it succeeds: check if a Beta(s + 1, j — s + 1) distributed random variable 
exceeds a threshold y. For each experiment, we generate the beta-distributed rv. independently of the previous ones. 
Now define X{j, s, y) to be the number of trials before the experiment succeeds. Thus, s, y) takes non-negative 
integer values, and is a geometric random variable with parameter (success probability) 1 — F^^^ j_^j^^{y). Here 
F^^y^ denotes the cdf of the beta distribution with parameters a, /3. Also, let F^^ denote the cdf of the binomial 
distribution with parameters 

We will relate Y and X shortly. The following lemma provides a handle on the expectation of X. 

Lemma 1. For all non-negative integers j, s < j, and for all y G [0, 1], 

E[X(j,s,y)] = ^^-1, 

where F^^, denotes the cdf of the binomial distribution with parameters {n,p). 

Proof. By the well-known formula for the expectation of a geometric random variable and the definition of X we 
have, E [X{j, s, y)] = - — p^^ta ~ 1 (The additive —1 is there because we do not count the final step where the 

Beta r.v. is greater than y.) The lemma then follows from Factfllin Appendix [B| □ 
Recall that Yj was defined as the number of steps before 6i (t) > 62 (t) happens for the first time after the j*'' play 

A 
2 



of the first arm. Now, consider the number of steps before 9i (t) > /i2 + y happens for the first time after the j*'' play 



of the first arm. Given s{j), this has the same distribution as X{j, s{j), fi2 + y )• However, Yj can be larger than this 
number if (and only if) at some time step t between tj and 02it) > M2 + y- In that case we use the fact that Yj 
is always bounded by T. Thus, for any j > jo, we can bound E[Yj] as, 

EK-] < E[min{X(j, s^), ^^2 + f ), T}] + E[j:l^;-^\ T ■ I{92{t) > + f )]■ 

Here notation I{E) is the indicator for event E, i.e., its value is 1 if event E happens and otherwise. In the first 
term of RHS, the expectation is over distribution of s{j) as well as over the distribution of the geometric variable 
X{j, s{j), fi-? + y)- Since we are interested only in j > jo, we will instead use the similarly obtained bound on 
EK- • 1(3 > jo)], 

EK- • I{j > jo)] < E[min{X(j, ^^2 + f ), T}] + EE^t+i T ■ I{e2{t) > + f ) • Kj > Jo)]- 
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This gives, 



< 
< 



ELo^ E[min{X(j, s(j), 



A 

2 

An 

2 ' 



T}] + T • Pr(02 W > A*2 + f , fc2(<) > i) 



The last inequality holds because for any t E [tj + 1, tj+i — > jo, by definition k2{t) > L. We denote the event 
{6*2 (i) < M2 + y or fc2(t) < L} by £2(1). In words, this is the event that if sufficient number of plays of second 
arm have happened until time t, then 92{t) is not much larger than fj,2', intuitively, we expect this event to be a high 
probability event as we will show. £2(1) is the event {6*2 (i) > fJ'2 + ^ ^2(i) > L} used in the above equation. 
Next, we bound Pr(£'2(i)) and E[mm{X{j, s(j), /^2 + f ), T}]. 

Lemma 2. 

Vt, PT{E2it))>l 



2 



Proof. Refer to Appendix C.l 



Lemma 3. Consider any positive y < pi, and let A' = /ii — y. Also, let R 



KL-divergence between /ii and y, i.e. D = yh\ 



E[E[min{XO-,s(j),y), T}\s{m < 



2 , Ml 



_ Atl(l-i') 



□ 



> 1, anc/ let D denote the 



1 



16 



1 - y 



A' 



+ A' 



A'2 I 



> 



41nT 
A'2 J 



where the outer expectation is taken over s{j) distributed as Binomial[j, /ii). 



Proof. The complete proof of this lemma is included in Appendix |C.2[ here we provide some high level ideas. 
Using Lemma [1] the expected value of X{j, s{j),y) for any given s{j), 

E[XU,s{j),y)\.sij)] = -. 1 



For large j, i.e., j > 4(ln T) /A , we use Chernoff-Hoeffding bounds to argue that with probability at least (1 — 1^), 
s{j) will be greater than /iij — A'j/2. And, for s{j) > fxij — A'j/2 = yj + A'j/2, we can show that the probability 
F^+i y{s{j)) will be at least 1 — again using Chernoff-Hoeffding bounds. These observations allow us to derive 
that E [E [mm{X{j, y), T}]] < ^, for j > 4(lnT)/A'2. 

For small j, the argument is more delicate. In this case, s{j) could be small with a significant probability. More 
precisely, s{j) could take a value s smaller than yj with binomial probability f^^_^ (s). For such s, we use the lower 
bound Ff+^Js) > (1 - y)Fj'y{s) + yF^^y{s - 1) > (1 - y)F^y{s) > (1 ~ y)fly{s), and then bound the ratio 
■fffj.i(^)/fl^y(^) terms of A', R and KL-divergence D. For s{j) — s > \yj~\, we use the observation that since 
\yj~\ is greater than or equal to the median of Binomial(j, y) (see |7 1), we have Fj^j^(s) > 1/2 . After some algebraic 
manipulations, we get the result of the lemma. □ 



Using Lemma |2j and Lemma [s] for y = ^2 
the second arm as: 



A/2, and A' = A/2, we can bound the expected number of plays of 



E[A:2(r)] = L + E 



< 



< 



< 



L 
L 



40 InT 
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E,to E[E [min{X(j,s(j),A^2 

41nT , v^4(lnT)/A''' 
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Figure 1 : Interval Ij 



where the last inequality is obtained after some algebraic manipulations; details are provided in Appendix C.3 
This gives a regret bound of 

E[7^(T)] ^ E [A . k^iT)] < (^^^ + ^ + 18a) . 



4 Regret bound for the A^-armed bandit problem 

In this section, we prove Theorem[2j our result for the A^-armed bandit problem. Again, we assume that all arms have 
Bernoulli distribution on rewards, and that the first arm is the unique optimal arm. 

At every time step t, we divide the set of suboptimal arms into saturated and unsaturated arms. We say that an arm 
z 7^ 1 is in the saturated set C{t) at time t, if it has been played at least Li ^ times before time t. We bound 

the regret due to playing unsaturated and saturated suboptimal arms separately. The former is easily bounded as we 
will see; most of the work is in bounding the latter. For this, we bound the number of plays of saturated arms between 
two consecutive plays of the first arm. 

In the following, by an interval of time we mean a set of contiguous time steps. Let rv. Ij denote the interval 
between (and excluding) the and [j + 1)*'* plays of the first arm. We say that event M{t) holds at time t, if 9i{t) 
exceeds /x; + ^ of all the saturated arms, i.e., 

M{t) : 0i{t) > max ^J,i + —. (2) 
iec(t) 2 

For t such that C{t) is empty, we define M{t) to hold trivially. 

Let r.v. 7j denote the number of occurrences of event M{t) in interval Ij: 

jj = \{teIy.M{t)^l}\. (3) 

Events M{t) divide Ij into sub-intervals in a natural way: For £ — 2 to 7^, let rv. Ij{£) denote the sub-interval of Ij 
between the {£ — 1)*^ and occurrences of event M (t) in Ij (excluding the time steps in which event M{t) occurs). 
We also define Ij{l) and /j(7j + 1): If 7j > then Ij{l) denotes the sub-interval in Ij before the first occurrence 
of event M{t) in Ij; and Iji'jj + 1) denotes the sub-interval in Ij after the last occurrence of event M{t) in Ij. For 
7j = we have Ij{l) — Ij- 

Figure [T] shows an example of interval Ij along with sub-intervals Ij{£)', in this figure jj — 4. 
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Observe that since a saturated arm i can be played at step t only if 9i{t) is greater than 6i{t), saturated arm i can 
be played at a time step t ^ Ij{£),y£,j (i.e., at a time step t where M{t) holds) only if 9i{t) > /i^ + Aj/2 . Let us 
define event E{t) as 

E{t) : {9,it) e [fl,-\/2,^l,+\/2lyi€C{t)}. 
Then, the number of plays of saturated arms in interval Ij is at most 

EtV\W)\ + Etei,nW))- 

In words, E{t) denotes the event that all saturated arms have 9i{t) tightly concentrated around their means. Intuitively, 
from the definition of saturated arms, E{t) should hold with high probability; we prove this in Lemma|4] 

We are interested in bounding regret due to playing saturated arms, which depends not only on the number of 



plays, but also on which saturated arm is played at each time step. Let 
which a is the best saturated arm, i.e. 



e.a 



denote the number of steps in Ij{£), for 



(4) 



■j = l{^ e ■ Ma = max.,gc(t) 

(resolve the ties for best saturated arm using an arbitrary, but fixed, ordering on arms). In Figure [T] we illustrate this 
notation by showing steps {Vy*'°} for interval /j (4). In the example shown, we assume that jii > ^2 > ■ ■ ■ > 1^6, ™d 
that the suboptimal arms got added to the saturated set C{t) in order 5, 3, 4, 2, 6, so that initially 5 is the best saturated 
arm, then 3 is the best saturated arm, and finally 2 is the best saturated arm. 

Recall that M{t) holds trivially for all t such that C{t) is empty. Therefore, there is at least one saturated arm at 
all t € Ij {£), and hence Vj , a — 2, . . . , N aie well defined and cover the interval Ij {£), 



Next, we will show that the regret due to playing a saturated arm at a time step t in one of the ' steps is at most 

3Aa + I{E{t)). The idea is that if all saturated arms have their 9i{t) tightly concentrated around their means fii, then 
either the arm with the highest mean (i.e., the best saturated arm a) or an arm with mean very close to will be 
chosen to be played during these Vj''^ steps. That is, if a saturated arm i is played at a time t among one of the Vj '"" 
steps, then, either E{t) is violated, i.e. 9i' (t) for some saturated arm i' is not close to its mean, or 

+ \/2 > 9,{t) > 9a{t) >na- A,/2, 

which implies that 



A, 



Ml - A^i < Ml - Ma + ^ + V ^ < 3Aa 



(5) 



Therefore, regret due to play of a saturated arm at a time t in one of the vf'°' steps is at most 3Aa + I{E{t)). With 



3 

e.a 



slight abuse of notation let us use t e to indicate that t is one of the Vj " steps in Ij {£). Then, the expected regret 
due to playing saturated arms in interval Ij is bounded as 

E[n%I,)] < E[ElLVEa=2Etevl^43Aa + IiW)))]+Etei,nW))- 
= E[E;r^'Ef=2 3A„KH +2E 



(6) 



The following lemma will be useful for bounding the second term on the right hand side in the above equation (as 
shown in the complete proof in Appendix [P]). 



Lemma 4. For all t. 



Also, for all t, j, and s < j. 



PiiEit)) > 1 



4{N - 1) 



rp2 



Fr{Eit) I = s) > 1 



4{N -1) 
2^2 ■ 



Proof. Refer to Appendix C.4 



□ 
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The stronger bound given by the second statement of lemma above will be useful later in bounding the first term 
on the rhs of (|6]l. For bounding that term, we establish the following lemma. 



Lemma 5. For all j. 



E 



EIlV Ea Vl'^^a < E E [(7, + 1) I m J:t2 AaE [min{X(j, Ma + n I s{j)] 



(7) 



Proof. The key observation used in proving this lemma is that given a fixed value of s{j) — s, the random variable 



Vj''^ is stochastically dominated by random variable X(j, s, fia + ^) (defined earlier as a geometric variable denoting 
the number of trials before an independent sample from Beta(s+1, j — s+1) distribution exceeds + A technical 
difficulty in deriving the inequality above is that the random variables jj and 1/ ' are not independent in general (both 
depend on the values taken by {Oi{t)} over the interval). This issue is handled through careful conditioning of the 
random variables on history. The details of the proof are provided in Appendix C.5 □ 

Now using the above lemma the first term in (|6]) can be bounded by 



3E 



E'jo E I s{j)] Ea AaE [min{X(j, ,s(j), y^), T} \ s{j)] + 3E A,E [min{X(j, y,), T} | s{j)]] 



We next show how to bound the first term in this equation; the second term will be dealt with in the complete proof 
in Appendix [D] 

Recall that jj denotes the number of occurrences of event M{t) in interval Ij, i.e. the number of times in interval 



l(<) was greater than /ij 



of all saturated arms i £ C{t), and yet the first arm was not played. The only 



reasons the first arm would not be played at a time t despite of 9i{t) > maxi^cit) Mi + are that either E{t) was 
violated, i.e. some saturated arm whose 9i{t) was not close to its mean was played instead; or some unsaturated arm 
u with highest 9u (t) was played. Therefore, the random variables jj satisfy 



jj < J2tei ^i^^ unstaurated arm is played at time t) + J2tei H^i^))- 

Using Lemma |4j and the fact that an unsaturated arm u can be played at most L„ times before it becomes saturated, 
we obtain that 



ELo niMj)] < ^(an unstaurated arm is played at time t)|s(i)] + ^ nEtei. 



< 
< 



(8) 



Note that EJ=o^ E[7j |s(j)] is a r.v. (because of random s{j)), and the above bound applies for all instantiations of 
this rv. 



Let fia 



Then, 



EfJo IE [7, I sij)] Ea AaE [X{j, .s{j),y,) \ sU)] 



E 



ET=o E hj I <J)] ) (max, Ea AaE [Xij, ,s(j), Va) I sm 



< (En Lu + m - 1)) Ea AaE [max, E [X{j, s{j),ya) I s{j)]] 

< (E„ Lu + 4{N - 1)) A,E [^—^^^ . lisU:) < [yajl\ ) + F^,^,t\su:)) ' ^(^(j^) ^ Tyaj*! )' 



(9) 



where 



fa = arg max E [X{j, s{j),ya) \ s{j)] = arg 

je{o,...,T-i} je{o 



max 



T-l} ^j + l,a„(s(j))' 
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Note that j* is a random variable, which is completely determined by the instantiation of random sequence s(l), s(2), 
Now, for the first term in above. 



E 



< E,E 

< y ■ ^ 



2^s=Q 



3 + 1, !Ja (*) 



e - A3 ' 



where A'^ — fJ'i — Ua — Aa/2, Da is the KL-divergence between Bernoulli distributions with parameters /ii and 
Ha- The penultimate inequaUty follows using (15\ in the proof of Lemma |3] in Appendix C.2 with A' = AJ,, and 

D = Da- The resulting bound on the first term in ([9]) is 0((X;„iu)Ea if) = '-'((Sa i^-) In T), which forms the 
dominating term in our regret bound. The bounds on the remaining terms and further details of the proof for regret due 
to saturated arms are provided in Appendix [P] Since an unsaturated arm u becomes saturated after L„ plays, regret 

due to unsaturated arms is at most J2u=2 ^u^u — 24(lnr) (^^^2 S^) ■ Summing the regret due to saturated and 
unsaturated arms, we obtain the result of Theorem|2] 



Conclusion. In this paper, we showed theoretical guarantees for Thompson Sampling close to other state of the art 
methods, like UCB. Our result is a first step in theoretical understanding of TS and there are several avenues to explore 
for the future work: There is a gap between our upper bounds and the lower bound of |9|. While it may be easy to 
improve the constant factors in our upper bounds by making the analysis more careful (but more complicated), it seems 
harder to improve the dependence on the A's. With further work, we hope that our techniques in this paper will be 
useful in providing several extensions, including analysis of TS for delayed and batched feedbacks, contextual bandits, 
prior mismatch and posterior reshaping discussed in |2 |. As mentioned before, empirically TS has been shown to have 
superior performance than other methods, especially for handling delayed feedback. A theoretical justification of this 
observation would require a tighter analysis of TS than what we have achieved here, and in addition, it would require 
lower bound on the regret of the other algorithms. TS has also been used for problems such as regularized logistic 
regression (see ||2j). These multi-parameter settings lack theoretical analysis. 
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A Multiple optimal arms 

Consider the A^-armed bandit problem with ji* = maxj (Xi. We will show that adding another arm with expected 
reward ^* can only decrease the expected regret of TS algorithm. Suppose that we added arm N + 1 with expected 
reward ^* . Consider the expected regret for the new bandit in time T, conditioned on the exact time steps among 
1, . . . , T, on which arm A/^ -|- 1 is played by the algorithm. Since the arm N has expected reward /z*, there is no 
regret in these time steps. Now observe that in the remaining time steps, the algorithm behaves exactly as it would for 
the original bandit with N arms. Therefore, given that the [N + 1)*'' arm is played x times, the expected regret in 
time T for the new bandit will be same as the expected regret in time T — a; for the original bandit. Let TZ^ {T) and 
7^^+^(T) denote the expected regret in time T for the original and new bandit, respectively. Then, 

E [7^^+l(T)] = E [e [7^^+l(T) I /c^+i(T)]] = E [e [7^^(T - fc^+i(T)) | A;jv+i(T)]] 

< E [e [7l^(T) |fciv+i(r)]] = e [7i^(r)] . 

This argument shows that the expected regret of Thompson Samphng for the AT-armed bandit problem with r optimal 
arms is bounded by the expected regret of Thompson Sampling for the {N — r + l)-armed bandit problem obtained 
on removing (any) r — 1 of the optimal arms. 



B Facts used in the analysis 

Fact 1. 

^^a7(2/) = l-^aVl,.("-l)' 

for all positive integers a, /3. 

Proof. This fact is well-known (it's mentioned on Wikipedia) but we are not aware of a specific reference. Since the 
proof is easy and short we will present a proof here. The Wikipedia page also mentions that it can be proved using 
integration by parts. Here we provide a direct combinatorial proof which may be new. 

One well-known way to generate a r.v. with cdf F^^^ - for integer a and (3 is the following: generate uniform 
in [0, 1] r.v.s Ai, A2, . . . , Xa+p-i independently. Let the values of these r.v. in sorted increasing order be denoted 
Xl,Xl,..., Xl^^_^. Then Xl has cdf F^^. Thus Fiy{y) is the probability that Xl < y. 

We now reinterpret this probability using the binomial distribution: The event X]^ < y happens iff for at least a 
of the Ai, . . . , Xa+p-i we have Aj < y. For each Xj we have Pr[A, < y] = y; thus the probability that for at most 
a — 1 of the Xi's we have A^ < y is F^_^^_-^y{a — 1). And so the probability that for at least a of the A^'s we have 
Xi<yisl-F^^^_,Ja-l). ' □ 
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The median of an integer-valued random variable X is an integer m such that Pr(X < m) > 1/2 and Pr(X > 
m) > 1/2- The following fact says that the median of the binomial distribution is close to its mean. 

Fact 2 (121). Median of the binomial distribution Binomial{n,p) is either \ np\ or \np\. 

Fact 3 ((Chernoff-Hoeffding bounds)). Let Xi, X.^ be random variables with common range [0, 1] and such that 
E [Xt I Xi, Xt-i] = ^l. Let Sn ^ Xi + . . . + X„. Then for all a > 0, 

Pr(S'„ >n^l + a)< e^^"^'/", 

Pr(S'„ <n^i-a)< e-^"'/". 

Lemma 6. For all n,p £ [0, 1], (5 > 0, 

F4(np - nS) < e''"'" , 1 - F^Jnp + nd) < e-'"'" , (10) 

l-F^_^,Jnp + nS)<^. (11) 

Proof. The first result is a simple application of Chernoff-Hoeffding bounds from Fact [3] For the second result, we 
observe that, 

Fn+i,pinp + nS) = (1 - p)Flp{np + nS) + pF^Jnp + nd ^ 1) > F^Jnp + nS - 1). 
By Chernoff-Hoeffding bounds. 



□ 



C Proofs of Lemmas 
C.1 Proof of Lemma H 

Proof. In this lemma, we lower bound the probability of E2 (t) by 1 — . Recall that event {t) holds if the following 
is true: 

{02W<M2 + Y}or{fc2(t)<L}. 



Also define A{t) as the event 

where 5*2 (t ) , A:2 (t ) denote the number of successes and number of plays respectively of the second arm until time t—1. 
We will upper bound the probability of Pr(i?2(i)) = 1 - Pr(i?2(i)) as: 



x/N S2{t) A 



A 



< Vr{A{t),k2{t)>L)+VY{62{t)>^l2 + ^,k2{t)>L,A{t)). (12) 

For clarity of exposition, let us define another random variable Z2.M, as the average number of successes over the first 
M plays of the second arm. More precisely, let random variable 2^2. m denote the output of the m*'* play of the second 
arm. Then, 

1 

m— 1 
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— S (t) — 

Note that by definition, ^2,fc2(t) ~ kl(t) • ^l^o, ^2, a/ is the average of M iid Bernoulli variables, each with mean ^2- 



Now, for all t. 



Pr ( A(t) ,h{t)>L) = E t ^ Pr(Z2,fe, (*) > + f , ^2 (t) = ^) 
= ELl Pr (^2,£ > ^^2 + f , A:2 (i) = ^) 



T ^-2£AVl6 
^e=L ^ 

< 



rp2 ■ 

The second last inequality is by applying Chernoff bounds, since Z2^£ is simply the average of £ iid Bernoulli variables 
each with mean //2. 

We will derive the bound on second probability term in ( [T2| i in a similar manner It will be useful to define W{£, z) as 
a random variable distributed as Beta(^z + 1,£ — £z + 1). Note that if at time t, the number of plays of second arm is 
k2{t) = £, then 02(t) is disti'ibuted as Beta(^Z2,£ + 1,£~ £^2.1 + 1), i-e. same as W{£,'Z2.i). 

T 

Vr{e2{t)> ii2 + ^,A{t),k2{t)>L) = J2PT{e2{t)> tl2 + ^,A{t),k2{t)^£) 

e=L 

T 



T 

= Y.^TiW{£,Z2,i)>Z2,i+j,k2{t)^£) 

e=L 

T 

< Y.Vr{W{£,Z2,i)>Z2,t+^) 



1 

(using Fact 1) = H ^ [f^^^^^^ (£^2,,) 



'2.1) 

^ , 2A2£2/16, 



< 7-P-2LAVI6 _ J_ 
- ^ ~ y2 • 

The third-last inequality follows from the observation that 

And, the second-last inequality follows from Chernoff-Hoeffding bounds (refer to Fact[3]and Lemma|6]l. □ 

C.2 Proof of Lemma |3] 

Proof. Using Leinma[T| the expected value of X{j, s{j),y) for any given s{j). 
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Case of large j: First, we consider the case of large j, i.e. when j > 4(lnr)/A'^. Then, by simple application of 
Chernoff-Hoeffding bounds (refer to Factjijand Lemma|6j, we can derive that for any s > {y + 



A'j p4A72 2A' 

-) > 1 

giving that for s > y{j + ^), E [X{j + 1, s, y)] < - 1. 

Again using Chernoff-Hoeffding bounds, the probability that s{j) takes values smaller than (y + can be 
bounded as, 

F^iyj + ^) - f-^Mj - ^) < e-'^^ <^<^- 

For these values of s(j), we will use the upper bound of T. Thus, 

E[min{E[X{j,sU),y)\sU)],T}] < (1 - S/T^) ■ f ] i) + ^ . t < 



(1 -8/T2) J r2 - 2^ 



Case of small j: For small j, the argument is more delicate. We use, 

E[E[X{j,s{j),y)\s{j)]]=E ^ 



1 



^ (s) 
\^JjJ^l}lL_i (13) 

s=o ^j+i,y^^) 



where f^^^_^ denotes pdf of the Binomial (j, /ii) distribution. We use the observation that for s > \y{j + 1)], 
F^+i y{s) > 1/2. This is because the median of a Binomial(7i,p) distribution is either [npj or \np] (see j?)). 
Therefore, 

^ (s) 

For smalls, i.e., s < [yjj, we use Ff+i,,Gs) = {l-y)F^^y{s)+yF,Js-l) > (1 and F^^(s) > 
to get 

'^JM± < V 1 /^.(-) 

^ (l-y) y^il-yy-- 



s=0 



< 



il-y)\ R-l ; (1 - 2/)J" 
il-y)R-l yy^{l-y) 



Ail - y A 

If lyj\ < [yil < \y{j + 1)1' f^hen we need to additionally consider s = \yj~\ . Note, however, that in this case 
\yj^ < yj + y- For s = lyjl , 



< 



2 

< ^ . (16) 

1-y 



15 



Alternatively, we can use the following bound for s = \yj~\ , 



< 



< 



< 



< 



< 



1 fLi^) 

(i-y) 

1 f^.M 



1 



(i-y) 
1 

Ry 
0^' 



R 



1 - Ml 



1-2/ 



(because s = [yj] < yj + y) 



(17) 



Next, we substitute the bounds from ([l4]i-([T7|l in Equation ([T3j to get the result in the lemma. In this substitution, 
for s = \yj~\ , we use the bound in Equation ( fTSj l when j < ^ In R, and the bound in Equation ( [Tt] ) when j > 

□ 



C.3 Details of Equation ^ 

Using Lemma[3]for y = ^2 + A/2, and A' = A/2, we can bound the expected number of plays of the second arm as: 



E[fc2(T)] 



i + E 



T-l 
3=30 



T-l 

< L+ ^ E 

3=0 



l{E 



T} 



^Pr(£;2(i))-r 



< i 



= L 



< L 



(*) 

< L 



< L 



L 



4(l„T)/A'^-l 



4(lnT)/A'^-l 



A'2 



E 

J=0 



E 



4(1„T)/A'=-1 

A'2 A' 



41nT/A'2--^ lnfl-1 



J=0 



Ini? • 

D (1-2/) 



E 

J=0 



1-2/ 

1 



T + 2 



1-2/ 



18 



n A' ^ 



A'2 13 
41nr D + 1 2 



A' ^ A' 



A'2 A'D A' A' (min{D, 1}) 
41nr 

- 18 
18 



18 



< 



401nr 48 



2 


1 


4 


A^ + 


A^ ^ 


^ A^ 


8 


16 


32 




" A4 " 


A3 


+ 18. 







A2 A4 

The step marked (*) is obtained using following derivations. 



1 D 1 Aii(l-2/ 1 /^i , 1 1^2/ / , y fr, ^ y\^^ 
ylni? = 2/ In— r = yln hyln- < ^i + [D -yln—) < 1 . 

2/(1-/^1 2/ (1-Mi 1-2/ Ml 1-2/ 



y . s D + 1 



A' 
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And, since D > (Gibbs' inequality), 

En, 1 r 2 e . 2 
■' = F7 < max{ — , } < ^. 
^D' e- 1^ - min{i:>,l} 

And, (**) uses Pinsker's inequality to obtain D > 2A'^. 



C.4 Proof of Lemma 13 

Proof. The proof of this lemma follows on the similar lines as the proof of Lemma |2] in Appendix C.l for the two 
arms case. We will prove the second statement, the first statement will follow as a corollary. 

To prove the second statement of this lemma, we are required to lower bound the probability of Pr{E{t)\s{j) — s) 
for all t,j, s < J, by 1 — y^"'"'' , where s{j) denotes the number of successes in first j plays of the first arm. Recall 
that event E{t) holds if the following is true: 

Let us define E^{t) as the event {9i{t) < fii + ^ ov i ^ C(t)}, and E^{t) as the event {0i{t) > or i 

Then, we can bound Pr{E{t)\s{j)) as 

N 



PT{E{tMj)) <J2P<E+{tMj))+PT{Er{tMj)). 



i=2 

Now, observe that 



Pv{Et{t)\s{j)) = Pr{e,{t) > + -i, fc.(<) > 



where ki{t) is the number of plays of arm i until time t — 1. 
As in the case of two arms, define Ai (t) as the event 

where Si (t) , ki (t) denote the number of successes and number of plays respectively of the i*'* arm until time t — 1 
We will upper bound the probability of Pt{E^ {t)\s{j)) for a\\t,j,i ^ 1, using, 

A,; 



Pr(i?+(i)|s(j)) - Pr(0,(t) > + -^,hit) > L,\s{j)) 



< PT{Mt),h{t) > +Pr(0,(t) > + ^,h{t) > L„Mt)\s{j)) 

(18) 



For clarity of exposition, similar to the two arms case, for every i = 1, . . . , N we define variables {Zi ,n}, and 
Zi^M- Zi^m denote the output of the m*'* play of the i*^ arm. And, 



M 

I 

Zi^M — 



1 



M 

m—l 

Note that for all i, m, Zi^m is Bernoulli variable with mean jii, and all i = l,...,iV, m=l,...,Tare indepen- 
dent of each other. 
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Now, instead of bounding the first term Pr{Ai{t), ki{t) > Li\s{j)), we prove a bound on Pr{A{t),k2{t) > 
L\Zi^i, . . . , Zij). Note that the latter bound is stronger, since s{j) is simply J2m=i ^i-m- 
Now, for alH, « 7^ 1, 



< 



-2^A|/16 



< 



T2 



The third last equality holds because for all i, i' , m, m! , Zi,m and Z^/ „i' are independent of each other, which means 
Zix is independent of Zi „ for all m = 1, . . . , j. The second last inequality is by applying Chernoff bounds, since 
Zix is simply the average of i iid Bernoulli variables each with mean 



We will derive the bound on second probability term in ( [181 ) in a similar manner. As before, it will be useful to 
define z) as a random variable distributed as Beta(fe + 1, f — + 1). Note that if at time t, the number of plays 
of arm i is ki(t) = £, then di{t) is distributed as Beta{iZi^i + l,i — iZi^i + 1), i.e. same as W{£, Zi^^). Now, for the 
second probability term in ( [T8] l, 



< 



(using Fact 1) = 



< 



J2 PT{e,it)> fi, + ^,Ait),h{t) = e\Zi^i,...,Zi,,) 

'=Li 



S^{t) A,, A 



hit) 
e=Li 

T 

< ^VY(W{l,Z,j)>Z,^t^ 

l=Li 
T 

t=Li 



t=Li 
T 



4 +^Mt)=^\Zi,i.....Z^^,) 



4 

A, 
4 

A, 

4 



-M(t) = i\z^,i,. 

■\ZlA, ■ ■ ■ , Zl.j) 



, Zij) 



e=Li 

T 



e=Li 

T 



E ^^p'f^ 



2A2£2/16- 



e=Li 



y2 ■ 



Here, we used the observation that for all i, i' , m, m', Zi^m and Zii^„i' are independent of each other, which means Zij 
and W{£, Zi^g) are independent of Zi „i for all to = 1, . . . ,j. The third-last inequality follows from the observation 
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that 



F^+iA^) = (1 - v)F^,p{r)+pF^^^{r - 1) < (1 - v)F^,p{r) + pF^^^^r) = F^^i^r). 

And, the second-last inequality follows from Chernoff-Hoeffding bounds (refer to Fact[3]and Lemma|6]l. Substituting 
above in Equation ( [T8] l, we get 



Pr(i?+(i)|s(j)) < 



Similarly, we can obtain 

Summing over i = 2, . . . , TV, we get 



VT{Er{t)\s{j))< 



rp2 
2 



PliEitMj)) < 



4(A^- 1) 
2^2 



which implies the second statement of the lemma. The first statement is a simple corollary of this. 



□ 



C.5 Proof of Lemma |5] 

Proof. 



E 



= E 



Let T^-i denote the history until before the beginning of interval (i.e. the values of 6i{t) and the outcomes 
of playing the arms until the time step before the first time step of Ij{t))- Note that the value of random variable 
Klj > ^ — 1) is completely determined by J^-i. Therefore, 



E 

= E 
= E 



v. 



3 



1(7, >f-l) .sij),Te-i\ s{j) 
• 1(7, >^-l) s{j) 



Recall that vf'"^ is the number of contiguous steps t for which a is the best arm in saturated set C{t) and iid 



variables Oi{t) have value smaller than jia + Observe that given s{j) = s and J-e-i, Vj''^ is the length of 

an interval which ends when the value of an iid Beta(s + 1, j — s + 1) distributed variable exceeds /ia + (i-C-, 
M{t) happens), or if an arm other than a becomes the best saturated arm, or if we reach time T. Therefore, given 
s{j),J'e-i, Vj''^ is stochastically dominated by mm{X{j, s{j), + %^), T}, where recall that X{j, s{j),y) was 
defined as the number of trials until an independent sample from Beta(s + 1, j — s + 1) distribution exceeds y. That 
is, for all a. 



E 



3 



= E [mm{X{j, s(j), Aia + ^),T} \ s{j)] . 



Substituting, we get, 
E 



T 



< EpLiE[niin{X(j, s(j),/ia + ^),r}|s(j)] • 1(7, >^-l) s{j) 

E [mm{X{j, s{j), fia + ^),T}\ ■ E [^Li 1(7, > ^ - 1) | 
E [min{X(j, s(j), + T} I s(j)] • E [7, + 1 1 s{j)]. 
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This immediately implies, 



E 



E:=2A.E Y.lZ V^'^ s{j) <E E:=2AaE[min{X(j,sO-),Ma + %^),T}|.(j)] -Eb.+ll-Kj)] 



□ 



D Proof of Theorem |2]: details 

We continue the proof from the main body of the paper. For the first term in Equation (|9]), 



E 



1 

Pj'a+i,ya{sii*a)) 



H-Sifa) < IVafJ) 



< 



HsU)<lyaj\) 



3 s 



(19) 



where = ni — ya = Aa/2, Da is the KL-divergence between Bernoulli distributions with parameters /ii and ya- 
The penultimate inequality follows using ( fTS) in the proof of Lemma|3]in Appendix C.2 with A' = A'^, and £> = Da- 
The last inequality uses the geometric series sum (note that Da > by Gibbs' inequality). 

V e-^-^ < , \r, < maxl^f , ^} < . rl, < 2 ^ 8 
And, for the second term, using the fact that Fj^i y{s) > (1 — y)Fj y(s), and that for s > \yj~\, Fj y{s) > 1/2 (Fact 

(20) 



Substituting the bound from Equation ([19]) and ( |20l i in Equation (|9]l. 



2 4 

< < — . 

1 " 2/q Aa 



E;=o E[E[7,>(j)]Ea3A,E[X(j,s(j),yJ|sO-)]] < (En + 4(iV - 1)) + 12). (21) 

Also, using Lemma|3]while substituting y with j/a = Ma + %^ ^"d A' with iJ.i — ya = 



T-l W 

EE(3Aa)E 

j=0 a=2 



E 



min{X(j, s(j),/ia + ^),T} s(j) 



16(ln T) 



< E(3Aa) E + E (3Aa) 

V^481nr 192 

< E^^ + — +48Aa. 



■ ^ 16(lnr) 
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Substituting bounds from ( |2T| ) and ( |22] l in Equation (|7]l. 



(22) 



73+1 



T-l 

EE 

j=0 L£=l a 



EE^'"3A. 



< L„ + 4(iV - 1)) + 12) + + + 48Aa) 

n a " a " 

< 1152(lnr)(^ -^f + 288(lnr) ^ + 48(lnr) E + ^Q^A^E 7^ + ^^(^ " 1) 
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Now, using the result that Pr{E{t)) < 4:{N — 1) /T^ (by Lemma|4]| with Equation (|6|, we can bound the total regret 
due to playing saturated arms as 



EE 



■73+1 



t 

i ?' ^ 

+48(lnT) XI TT + ^^SiV^ + 96(7V - 1) + 8(iV - 1). 
Since an unsaturated arm u becomes saturated after plays, regret due to unsaturated arms is at most 

AT / ^ 1 \ 

E[7^"(^)] <Y,U^u = 24(lnT) E ^ • 

u=2 \m=2 "/ 

Summing the regret due to saturated and unsaturated arms, we obtain the result of Theorem|2] 

The proof for the alternate bound in Remark[T]will essentially follow the same lines except that instead of dividing 
the interval Ij{tj into subdivisions Vj'"", we will simply bound the regret due to saturated arms by number of plays 
times A„,„r- That is, we will use the bound. 



7i+l 



To bound EE^t we follow the proof for bounding EE^V ^/^"] for 

a — argmaxi^i /i^, i.e., replacing 
jia with fia = maxi^i //i, and with Ami„. In a manner similar to LemmajS] we can obtain 



73+1 T^^.ai 



7j+l 



I^J Wl] ^ E[(Tj- + 1) min{X(j, MM 



A 



min 

2^ 



),r}]+E[XT./(i?(i))] 

teij 



And, consequently, using Equation (|9|, and Equation ([T9])-(|22|, and Lemma|4] we can obtain 
giving a regret bound of 0(^5^ {j2a=2 Sf ) 1^^)- 
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